6 research outputs found

    Face Recognition with Attention Mechanisms

    Get PDF
    Face recognition has been widely used in people’s daily lives due to its contactless process and high accuracy. Existing works can be divided into two categories: global and local approaches. The mainstream global approaches usually extract features on whole faces. However, global faces tend to suffer from dramatic appearance changes under the scenarios of large pose variations, heavy occlusions, and so on. On the other hand, since some local patches may remain similar, they can play an important role in such scenarios. Existing local approaches mainly rely on cropping local patches around facial landmarks and then extracting corresponding local representations. However, facial landmark detection may be inaccurate or even fail, which would limit their applications. To address this issue, attention mechanisms are applied to automatically locate discriminative facial parts, while suppressing noisy parts. Following this motivation, several models are proposed, including: the Local multi-Scale Convolutional Neural Networks (LS-CNN), Hierarchical Pyramid Diverse Attention (HPDA) networks, Contrastive Quality-aware Attentions (CQA-Face), Diverse and Sparse Attentions (DSA-Face), and Attention Augmented Networks (AAN-Face). Firstly, a novel spatial attention (local aggregation networks, LANet) is proposed to adaptively locate useful facial parts. Meanwhile, different facial parts may appear at different scales due to pose variations and expression changes. In order to solve this issue, LS-CNN are proposed to extract discriminative local information at different scales. Secondly, it is observed that some important facial parts may be neglected, if without a proper guidance. Besides, hierarchical features from different layers are not fully exploited which can contain rich low-level and high-level information. To overcome these two issues, HPDA are proposed. Specifically, a diverse learning is proposed to enlarge the Euclidean distances between each two spatial attention maps, locating diverse facial parts. Besides, hierarchical bilinear pooling is adopted to effectively combine features from different layers. Thirdly, despite the decent performance of the HPDA, the Euclidean distance may not be flexible enough to control the distances between each two attention maps. Further, it is also important to assign different quality scores for various local patches because various facial parts contain information with various importance, especially for faces with heavy occlusions, large pose variations, or quality changes. The CQA-Face is proposed which mainly consists of the contrastive attention learning and quality-aware networks where the former proposes a better distance function to enlarge the distances between each two attention maps and the latter applies a graph convolutional network to effectively learn the relations among different facial parts, assigning higher quality scores for important patches and smaller values for less useful ones. Fourthly, the attention subset problem may occur where some attention maps are subsets of other attention maps. Consequently, the learned facial parts are not diverse enough to cover every facial detail, leading to inferior results. In our DSA-Face model, a new pairwise self-constrastive attention is proposed which considers the complement of subset attention maps in the loss function to address the aforementioned attention subset problem. Moreover, a attention sparsity loss is proposed to suppress the responses around noisy image regions, especially for masked faces. Lastly, in existing popular face datasets, some characteristics of facial images (e.g. frontal faces) are over-represented, while some characteristics (e.g. profile faces) are under-represented. In AAN-Face model, attention erasing is proposed to simulate various occlusion levels. Besides, attention center loss is proposed to control the responses on each attention map, guiding it to focus on the similar facial part. Our works have greatly improved the performance of cross-pose, cross-quality, cross-age, cross-modality, and masked face matching tasks

    Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work

    Full text link
    Inspired by the fact that human brains can emphasize discriminative parts of the input and suppress irrelevant ones, substantial local mechanisms have been designed to boost the development of computer vision. They can not only focus on target parts to learn discriminative local representations, but also process information selectively to improve the efficiency. In terms of application scenarios and paradigms, local mechanisms have different characteristics. In this survey, we provide a systematic review of local mechanisms for various computer vision tasks and approaches, including fine-grained visual recognition, person re-identification, few-/zero-shot learning, multi-modal learning, self-supervised learning, Vision Transformers, and so on. Categorization of local mechanisms in each field is summarized. Then, advantages and disadvantages for every category are analyzed deeply, leaving room for exploration. Finally, future research directions about local mechanisms have also been discussed that may benefit future works. To the best our knowledge, this is the first survey about local mechanisms on computer vision. We hope that this survey can shed light on future research in the computer vision field

    Vision Transformer with Attentive Pooling for Robust Facial Expression Recognition

    Full text link
    Facial Expression Recognition (FER) in the wild is an extremely challenging task. Recently, some Vision Transformers (ViT) have been explored for FER, but most of them perform inferiorly compared to Convolutional Neural Networks (CNN). This is mainly because the new proposed modules are difficult to converge well from scratch due to lacking inductive bias and easy to focus on the occlusion and noisy areas. TransFER, a representative transformer-based method for FER, alleviates this with multi-branch attention dropping but brings excessive computations. On the contrary, we present two attentive pooling (AP) modules to pool noisy features directly. The AP modules include Attentive Patch Pooling (APP) and Attentive Token Pooling (ATP). They aim to guide the model to emphasize the most discriminative features while reducing the impacts of less relevant features. The proposed APP is employed to select the most informative patches on CNN features, and ATP discards unimportant tokens in ViT. Being simple to implement and without learnable parameters, the APP and ATP intuitively reduce the computational cost while boosting the performance by ONLY pursuing the most discriminative features. Qualitative results demonstrate the motivations and effectiveness of our attentive poolings. Besides, quantitative results on six in-the-wild datasets outperform other state-of-the-art methods.Comment: Codes will be public on https://github.com/youqingxiaozhua/APVi
    corecore